We will go through the following process: Explore data -> Fit model -> Evaluate model -> Deploy model
# Import relevant libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
import missingno
Perform exploratory analysis on the variables using the whole data set. Describe the data and comment on your observations/findings.
The response variable in this dataset is likely to be diabetes, which is a binary variable indicating whether the participant has diabetes (often derived from the glycohemoglobin measurements, typically GH >= 6.5%).
#df = pd.read_excel('Glycohemoglobin_t1.xlsx')
#df = pd.read_excel('Glycohemoglobin_t3.xlsx')
#df = pd.read_excel('Glycohemoglobin_t4.xlsx')
df = pd.read_excel('Glycohemoglobin_t4_ShortName.xlsx')
df.head()
| diabetes | Sex | Age | Race_Or_Ethnicity | Income_Min | Income_Max | On_Insulin_or_Diabetes_Meds | Weight_Kg | Height_cm | BMI | Up_Leg_Len | Up_Arm_Len | Arm_Cir | Waist_Cir | Triceps_Sk | Subscapular_Sk | Alibumin | Blood_Urea | Creatinine | gh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 34.2 | 3 | 25000.0 | 35000.0 | 0 | 87.4 | 164.7 | 32.22 | 41.5 | 40.0 | 36.4 | 100.4 | 16.4 | 24.9 | 4.8 | 6.0 | 0.94 | 5.2 |
| 1 | 0 | 1 | 16.8 | 2 | 45000.0 | 55000.0 | 0 | 72.3 | 181.3 | 22.00 | 42.0 | 39.5 | 26.6 | 74.7 | 10.2 | 10.5 | 4.6 | 9.0 | 0.89 | 5.7 |
| 2 | 1 | 0 | 60.2 | 2 | 10000.0 | 15000.0 | 1 | 116.8 | 166.0 | 42.39 | 35.3 | 39.0 | 42.2 | 118.2 | 29.6 | 35.6 | 3.9 | 10.0 | 1.11 | 6.0 |
| 3 | 0 | 1 | 26.1 | 1 | 25000.0 | 35000.0 | 0 | 97.6 | 173.0 | 32.61 | 41.7 | 38.7 | 37.0 | 103.7 | 19.0 | 23.2 | 4.2 | 8.0 | 0.80 | 5.1 |
| 4 | 0 | 0 | 49.7 | 3 | 35000.0 | 45000.0 | 0 | 86.7 | 168.4 | 30.57 | 37.5 | 36.1 | 33.3 | 107.8 | 30.3 | 28.0 | 4.3 | 13.0 | 0.79 | 5.3 |
missingno.matrix(df, figsize = (30,10))
<Axes: >
# To check data types, type: df.dtypes
df.dtypes
diabetes int64 Sex int64 Age float64 Race_Or_Ethnicity int64 Income_Min float64 Income_Max float64 On_Insulin_or_Diabetes_Meds int64 Weight_Kg float64 Height_cm float64 BMI float64 Up_Leg_Len float64 Up_Arm_Len float64 Arm_Cir float64 Waist_Cir float64 Triceps_Sk float64 Subscapular_Sk float64 Alibumin float64 Blood_Urea float64 Creatinine float64 gh float64 dtype: object
df.shape
(6795, 20)
df.describe(include='all')
| diabetes | Sex | Age | Race_Or_Ethnicity | Income_Min | Income_Max | On_Insulin_or_Diabetes_Meds | Weight_Kg | Height_cm | BMI | Up_Leg_Len | Up_Arm_Len | Arm_Cir | Waist_Cir | Triceps_Sk | Subscapular_Sk | Alibumin | Blood_Urea | Creatinine | gh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6795.000000 | 6795.000000 | 6795.000000 | 6795.000000 | 6400.000000 | 5366.000000 | 6795.000000 | 6795.000000 | 6795.000000 | 6795.000000 | 6564.000000 | 6616.000000 | 6607.000000 | 6556.000000 | 6314.000000 | 5824.000000 | 6706.000000 | 6706.000000 | 6706.000000 | 6795.000000 |
| mean | 0.134511 | 0.496247 | 44.293304 | 2.637233 | 41262.500000 | 41764.815505 | 0.091832 | 79.370625 | 167.042958 | 28.321741 | 38.409324 | 36.874607 | 32.485152 | 96.254149 | 18.787726 | 19.961556 | 4.273621 | 12.917686 | 0.878627 | 5.676586 |
| std | 0.341225 | 0.500023 | 20.593529 | 1.088434 | 31013.885149 | 27238.504068 | 0.288810 | 21.930903 | 10.264984 | 6.950110 | 3.876902 | 2.781616 | 5.297660 | 17.059193 | 8.319393 | 8.369083 | 0.326545 | 5.717571 | 0.445238 | 0.964700 |
| min | 0.000000 | 0.000000 | 12.000000 | 1.000000 | 0.000000 | 5000.000000 | 0.000000 | 28.000000 | 123.300000 | 13.180000 | 20.400000 | 24.800000 | 16.800000 | 52.000000 | 2.600000 | 3.800000 | 2.500000 | 1.000000 | 0.140000 | 4.000000 |
| 25% | 0.000000 | 0.000000 | 25.700000 | 2.000000 | 20000.000000 | 20000.000000 | 0.000000 | 64.000000 | 159.600000 | 23.430000 | 36.000000 | 35.000000 | 28.850000 | 83.500000 | 12.000000 | 13.000000 | 4.100000 | 9.000000 | 0.700000 | 5.200000 |
| 50% | 0.000000 | 0.000000 | 43.800000 | 3.000000 | 35000.000000 | 35000.000000 | 0.000000 | 76.300000 | 166.600000 | 27.290000 | 38.400000 | 36.800000 | 32.100000 | 95.300000 | 17.900000 | 19.400000 | 4.300000 | 12.000000 | 0.830000 | 5.500000 |
| 75% | 0.000000 | 1.000000 | 61.300000 | 3.000000 | 65000.000000 | 55000.000000 | 0.000000 | 91.100000 | 174.500000 | 31.880000 | 41.000000 | 38.800000 | 35.600000 | 106.900000 | 25.000000 | 26.200000 | 4.500000 | 15.000000 | 0.980000 | 5.800000 |
| max | 1.000000 | 1.000000 | 80.000000 | 5.000000 | 100000.000000 | 100000.000000 | 1.000000 | 239.400000 | 202.700000 | 84.870000 | 50.600000 | 47.000000 | 61.000000 | 179.000000 | 41.100000 | 40.400000 | 5.300000 | 90.000000 | 15.660000 | 16.400000 |
# To visualise correlation matrix in a heatmap, type: sns.heatmap(df.corr(), annot=True, cmap='coolwarm');
# sns.heatmap(df.corr(), annot=True, cmap='coolwarm');
plt.figure(figsize=(12, 10)) # Adjust the figure size
heatmap = sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f", annot_kws={"size": 8})
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation=45, horizontalalignment='right', fontsize=8)
heatmap.set_yticklabels(heatmap.get_yticklabels(), fontsize=8)
plt.show()
def correlation_heatmap(df):
plt.figure(figsize=(16, 14))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(
df.corr(),
cmap=colormap,
square=True,
cbar_kws={'shrink': 0.9},
annot=True,
linewidths=0.9,
vmax=1.0,
linecolor='white',
annot_kws={'fontsize': 9}
)
plt.title('Pearson Correlation of Features', y=1.05, size=15)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.show()
# Assuming 'df' is your DataFrame
correlation_heatmap(df)
The heatmap displays the Pearson correlation coefficients between various features in the dataset. Here's a brief description of the heatmap:
Diagonal Elements: All diagonal elements are 1, indicating that each feature is perfectly correlated with itself.
Target Variable (diabetes):
High Positive Correlations:
High Negative Correlations:
General Observations:
# To visualise pairwise relationship using Seaborn pairplot, type: sns.pairplot(df);
# Caution: It may take some time to generate a pairplot, especially when number of features is large.
sns.pairplot(df);
The pairplot or scatterplot matrix is used to visualize the pairwise relationships between multiple variables in a dataset. Here’s a brief description of the plot:
Diagonal: The diagonal of the pair plot typically contains histograms or kernel density estimates (KDEs) of each variable, showing the distribution of individual variables.
Upper Triangle: The upper triangle contains scatter plots of the pairwise combinations of variables. These plots help to identify relationships, correlations, or patterns between pairs of variables.
Lower Triangle: The lower triangle is usually a mirror image of the upper triangle, showing the same scatter plots for pairwise combinations of variables.
Axes: Each row and column represent a different variable from the dataset. The axes labels indicate the variable names.
Distribution of Individual Variables:
Linear Relationships:
Non-linear Relationships:
Clusters:
Outliers:
Discrete Variables:
Symmetry:
The scatter plots in the upper and lower triangles provide a visual way to detect linear or non-linear relationships, clusters, outliers, and other patterns within the data. The pair plot is especially useful for exploring the relationships in a dataset with multiple numerical variables. These observations can help in understanding the underlying structure of the data, identifying patterns and relationships, and guiding further analysis or modeling efforts.
Split the data set into training set and testing set in a (approximate) ratio 75:25. Set random state/seed using the last 4 digits of your SP admission number. Fit the full additive MLR model on the training set.
# Split data set into 75:25.
from sklearn.model_selection import train_test_split
dftrain, dftest = train_test_split(df, test_size=0.25, random_state=1938)
# If you want to confirm ratio splitted:
print( len(dftrain)/len(df) )
print( len(dftest)/len(df) )
0.749963208241354 0.25003679175864607
#mlr_full = ols("diabetes ~ Sex + Age + Race_Or_Ethnicity + Income_Min + Income_Max + On_Insulin_or_Diabetes_Meds + Weight_Kg + Height_cm + BMI + Upper_Leg_Length_cm + Upper_Arm_Length_cm + Arm_Circumference_cm + Waist_Circumference_cm + Triceps_Skinfold_mm + Subscapular_Skinfold_mm + Albumin_g-dL + Blood_urea_nitrogen_mg-dl + Creatinine_mg-dl + gh#", dftrain).fit()
#mlr_full = ols(
# "diabetes ~ Sex + Age + Race_Or_Ethnicity + Income_Min + Income_Max + On_Insulin_or_Diabetes_Meds + "
# "Weight_Kg + Height_cm + BMI + Upper_Leg_Length_cm + Upper_Arm_Length_cm + Arm_Circumference_cm + "
# "Waist_Circumference_cm + Triceps_Skinfold_mm + Subscapular_Skinfold_mm + Albumin_g_dL + "
# "Blood_urea_nitrogen_mg_dl + Creatinine_mg_dl + gh", dftrain
#).fit()
formula = (
"diabetes ~ Sex + Age + Race_Or_Ethnicity + Income_Min + Income_Max + On_Insulin_or_Diabetes_Meds + Weight_Kg + Height_cm + BMI + Up_Leg_Len + Up_Arm_Len + Arm_Cir + Waist_Cir + Triceps_Sk + Subscapular_Sk + Alibumin + Blood_Urea + Creatinine + gh"
)
mlr_full = ols(formula, dftrain).fit()
print(mlr_full.summary())
print('\nMSE =', mlr_full.mse_resid)
OLS Regression Results
==============================================================================
Dep. Variable: diabetes R-squared: 0.611
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 265.9
Date: Thu, 25 Jul 2024 Prob (F-statistic): 0.00
Time: 11:40:06 Log-Likelihood: 608.35
No. Observations: 3238 AIC: -1177.
Df Residuals: 3218 BIC: -1055.
Df Model: 19
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 0.1418 0.283 0.502 0.616 -0.412 0.696
Sex -0.0085 0.012 -0.707 0.480 -0.032 0.015
Age 0.0006 0.000 2.414 0.016 0.000 0.001
Race_Or_Ethnicity 0.0021 0.003 0.631 0.528 -0.004 0.009
Income_Min 2.577e-06 1.4e-06 1.844 0.065 -1.63e-07 5.32e-06
Income_Max -2.19e-06 1.14e-06 -1.924 0.054 -4.42e-06 4.15e-08
On_Insulin_or_Diabetes_Meds 0.8028 0.016 48.899 0.000 0.771 0.835
Weight_Kg 0.0029 0.002 1.592 0.112 -0.001 0.006
Height_cm -0.0019 0.002 -1.091 0.275 -0.005 0.002
BMI -0.0076 0.005 -1.460 0.144 -0.018 0.003
Up_Leg_Len -0.0032 0.002 -1.952 0.051 -0.006 1.51e-05
Up_Arm_Len -0.0021 0.003 -0.807 0.420 -0.007 0.003
Arm_Cir -0.0007 0.002 -0.344 0.731 -0.005 0.003
Waist_Cir 0.0011 0.001 1.430 0.153 -0.000 0.003
Triceps_Sk 0.0002 0.001 0.251 0.802 -0.001 0.002
Subscapular_Sk -0.0002 0.001 -0.329 0.742 -0.002 0.001
Alibumin 0.0013 0.013 0.106 0.916 -0.024 0.026
Blood_Urea -0.0003 0.001 -0.331 0.740 -0.002 0.001
Creatinine 0.0497 0.012 4.273 0.000 0.027 0.073
gh 0.0499 0.005 10.041 0.000 0.040 0.060
==============================================================================
Omnibus: 2480.587 Durbin-Watson: 2.046
Prob(Omnibus): 0.000 Jarque-Bera (JB): 38848.430
Skew: 3.662 Prob(JB): 0.00
Kurtosis: 18.307 Cond. No. 5.07e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.07e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
MSE = 0.04046012122620297
Conduct relevant diagnostics on the full MLR model fitted. Evaluate the model from the perspectives of model fit, prediction accuracy, model/predictor significance, and checking of assumptions.
How well does the model fit the data? Is the model likely to be useful for prediction? Are any of the basic assumptions violated?
We can perform the following dianostics:
# To extract Rsq value, type: modelname.rsquared
mlr_full.rsquared
0.6109219787321869
We can use visualizations or run statistical tests to check if residuals satisfy the normality assumption.
# To produce residual plots, we have to extract the residuals first. Type: residname = modelname.resid
res = mlr_full.resid
len(res)
3238
# Check whether residuals are normally distributed visually in a histogram.
# Type: plt.hist(residname);
plt.hist(res);
# Alternatively, use a distribution plot which is a histogram with a smooth curve fitted to it.
# Type: sns.distplot(residname);
sns.distplot(res);
C:\Users\match\AppData\Local\Temp\ipykernel_15732\3384519445.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(res);
The plot shown is a histogram with an overlaid kernel density estimate (KDE) plot. Here’s a detailed description of the plot based on the dataset:
Distribution:
Skewness:
Density:
Outliers:
Central Tendency:
Bimodal Nature:
Data Spread:
Given the nature of the distribution, the variable could represent a standardized or normalized measure where most data points cluster around a mean value of 0. The secondary peak around 1 might indicate a different category or group within the data that exhibits distinct behavior.
This plot provides valuable insights into the distribution of the variable, highlighting its central tendency, skewness, and the presence of bimodal characteristics. Understanding this distribution is crucial for further analysis, such as identifying relationships with other variables or preparing the data for modeling.
# Construct a QQ plot of residuals. Type: sm.qqplot(residname, line='s');
sm.qqplot(res, line='s');
To check normality or residuals:
This QQ plot show errors are not normally distributed.
sm.stats.jarque_bera(res)
(38848.42964160138, 0.0, 3.661758118458597, 18.30718780710834)
sm.stats.omni_normtest(res)
NormaltestResult(statistic=2480.587237064726, pvalue=0.0)
from scipy.stats import shapiro
shapiro(res)
ShapiroResult(statistic=0.4599756598472595, pvalue=0.0)
from scipy.stats import anderson
anderson(res)
AndersonResult(statistic=601.5140106083691, critical_values=array([0.575, 0.655, 0.786, 0.917, 1.091]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]), fit_result= params: FitParams(loc=-4.4162600966201226e-13, scale=0.20055581564425068) success: True message: '`anderson` successfully fit the distribution to the data.')
sm.stats.normal_ad(res)
(601.5140106083645, 0.0)
We can use visualization or run statistical test to check if residuals satisfy the homoscedasticity assumption.
sns.residplot(x=mlr_full.fittedvalues, y=res, lowess=True)
plt.title('Residual Plot')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
The points show two linear model which is inappropriate.
# Perform the Breusch-Pagan test on residuals. Type: sm.stats.het_breuschpagan(residname, modelname.model.exog)
# The function returns: Lagrange multiplier statistic, p-value, F-value for B-P Constant Variance, F p-value
sm.stats.het_breuschpagan(res, mlr_full.model.exog)
(172.06439135842388, 1.120159231310755e-26, 9.50518145313247, 1.7492470076191728e-27)
# Plot a line chart of the residuals, according to their observed order, to visually check independence.
# Type:
plt.plot(res);
# Add in titles and axes labels
plt.title('Residuals vs Observation Order')
plt.xlabel('Observation Order')
plt.ylabel('Residuals');
sm.stats.durbin_watson(res)
# no autocorrelation
2.045914204347445
We can check for multicollinearity in the data set by:
| VIF | Interpretation |
|---|---|
| VIF = 1 | Not correlated |
| 1<VIF<5 | Moderately correlated |
| VIF > 5 | Highly correlated |
# Construct a heatmap to check correlations between predictors.
plt.figure(figsize=(12, 10)) # Adjust the figure size
heatmap = sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f", annot_kws={"size": 8})
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation=45, horizontalalignment='right', fontsize=8)
heatmap.set_yticklabels(heatmap.get_yticklabels(), fontsize=8)
plt.show()
No serious multicollinearity, continue to check
# To extract condition number, type: modelname.condition_number
mlr_full.condition_number # for the whole model
#still acceptable, why so high?
5068760.69800072
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
# Determine the number of predictors
num_predictors = len(mlr_full.model.exog_names)
# Loop through each predictor to calculate VIF
for i in range(num_predictors):
predictor = mlr_full.model.exog_names[i]
VIF = vif(mlr_full.model.exog, i)
print(predictor, VIF)
Intercept 6389.317868861221 Sex 2.8787449608698688 Age 2.2290836842231854 Race_Or_Ethnicity 1.0678220297686123 Income_Min 77.22502762694421 Income_Max 77.21086759190354 On_Insulin_or_Diabetes_Meds 1.4916494583292275 Weight_Kg 88.64068849733286 Height_cm 25.671515942510204 BMI 68.7191015411717 Up_Leg_Len 3.280300323715118 Up_Arm_Len 4.283182038790671 Arm_Cir 7.6693922110635455 Waist_Cir 10.732482118563432 Triceps_Sk 3.7365812389703557 Subscapular_Sk 2.9716903309761515 Alibumin 1.355190950008615 Blood_Urea 1.5609375515170838 Creatinine 1.4694911937463215 gh 1.6017062800324287
Explain how the model is improved after applying each of the techniques.
res.idxmin() ## or, res.idxmax()
229
res[229]
-1.0509495994901645
df.iloc[229,:]
diabetes 0.00 Sex 0.00 Age 70.40 Race_Or_Ethnicity 1.00 Income_Min 10000.00 Income_Max 15000.00 On_Insulin_or_Diabetes_Meds 1.00 Weight_Kg 66.00 Height_cm 156.60 BMI 26.91 Up_Leg_Len 32.00 Up_Arm_Len 34.60 Arm_Cir 30.60 Waist_Cir 94.00 Triceps_Sk 28.50 Subscapular_Sk 6.40 Alibumin 4.20 Blood_Urea 31.00 Creatinine 1.44 gh 8.10 Name: 229, dtype: float64
df2 = df.drop([229], axis=0)
df2.shape
(6794, 20)
# Split data set into 75:25.
from sklearn.model_selection import train_test_split
dftrain2, dftest2 = train_test_split(df2, test_size=0.25, random_state=1938)
# If you want to confirm ratio splitted:
print( len(dftrain2)/len(df2) )
print( len(dftest2)/len(df2) )
0.7499264056520459 0.25007359434795406
print(dftrain2.isna().any())
diabetes False Sex False Age False Race_Or_Ethnicity False Income_Min True Income_Max True On_Insulin_or_Diabetes_Meds False Weight_Kg False Height_cm False BMI False Up_Leg_Len True Up_Arm_Len True Arm_Cir True Waist_Cir True Triceps_Sk True Subscapular_Sk True Alibumin True Blood_Urea True Creatinine True gh False dtype: bool
missingno.matrix(dftrain2, figsize = (30,10))
<Axes: >
formula = (
"diabetes ~ Sex + Age + Race_Or_Ethnicity + Income_Min + Income_Max + On_Insulin_or_Diabetes_Meds + Weight_Kg + Height_cm + BMI + Up_Leg_Len + Up_Arm_Len + Arm_Cir + Waist_Cir + Triceps_Sk + Subscapular_Sk + Alibumin + Blood_Urea + Creatinine + gh"
)
mlr_full_2 = ols(formula, dftrain).fit()
print(mlr_full_2.summary())
print('\nMSE =', mlr_full_2.mse_resid)
#formula = (
# "diabetes ~ Sex + Age + On_Insulin_or_Diabetes_Meds + Height_cm + BMI + Arm_Cir + Waist_Cir + Alibumin + Blood_Urea + Creatinine + gh"
#)
#mlr_full_2 = ols(formula, dftrain).fit()
#print(mlr_full_2.summary())
#print('\nMSE =', mlr_full_2.mse_resid)
OLS Regression Results
==============================================================================
Dep. Variable: diabetes R-squared: 0.611
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 265.9
Date: Thu, 25 Jul 2024 Prob (F-statistic): 0.00
Time: 12:19:59 Log-Likelihood: 608.35
No. Observations: 3238 AIC: -1177.
Df Residuals: 3218 BIC: -1055.
Df Model: 19
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 0.1418 0.283 0.502 0.616 -0.412 0.696
Sex -0.0085 0.012 -0.707 0.480 -0.032 0.015
Age 0.0006 0.000 2.414 0.016 0.000 0.001
Race_Or_Ethnicity 0.0021 0.003 0.631 0.528 -0.004 0.009
Income_Min 2.577e-06 1.4e-06 1.844 0.065 -1.63e-07 5.32e-06
Income_Max -2.19e-06 1.14e-06 -1.924 0.054 -4.42e-06 4.15e-08
On_Insulin_or_Diabetes_Meds 0.8028 0.016 48.899 0.000 0.771 0.835
Weight_Kg 0.0029 0.002 1.592 0.112 -0.001 0.006
Height_cm -0.0019 0.002 -1.091 0.275 -0.005 0.002
BMI -0.0076 0.005 -1.460 0.144 -0.018 0.003
Up_Leg_Len -0.0032 0.002 -1.952 0.051 -0.006 1.51e-05
Up_Arm_Len -0.0021 0.003 -0.807 0.420 -0.007 0.003
Arm_Cir -0.0007 0.002 -0.344 0.731 -0.005 0.003
Waist_Cir 0.0011 0.001 1.430 0.153 -0.000 0.003
Triceps_Sk 0.0002 0.001 0.251 0.802 -0.001 0.002
Subscapular_Sk -0.0002 0.001 -0.329 0.742 -0.002 0.001
Alibumin 0.0013 0.013 0.106 0.916 -0.024 0.026
Blood_Urea -0.0003 0.001 -0.331 0.740 -0.002 0.001
Creatinine 0.0497 0.012 4.273 0.000 0.027 0.073
gh 0.0499 0.005 10.041 0.000 0.040 0.060
==============================================================================
Omnibus: 2480.587 Durbin-Watson: 2.046
Prob(Omnibus): 0.000 Jarque-Bera (JB): 38848.430
Skew: 3.662 Prob(JB): 0.00
Kurtosis: 18.307 Cond. No. 5.07e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.07e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
MSE = 0.04046012122620297
# Perform Backward Selection based on p-Values. Removed predictor with p-Values higher than 0.05
#"diabetes ~ Sex + Age + Income_Max + On_Insulin_or_Diabetes_Meds + Up_Leg_Len + Creatinine + gh"
mlr_full_3 = ols( "diabetes ~ Age + On_Insulin_or_Diabetes_Meds + Height_cm + BMI + Arm_Cir + Waist_Cir + Alibumin + Blood_Urea + Creatinine + gh", dftrain2).fit()
#mlr_full_3 = ols( "diabetes ~ Sex + Age + On_Insulin_or_Diabetes_Meds + Height_cm + BMI + Arm_Cir + Waist_Cir + Alibumin + Blood_Urea + Creatinine + gh", dftrain2).fit()
print(mlr_full_3.summary())
print('\nMSE =', mlr_full_3.mse_resid)
OLS Regression Results
==============================================================================
Dep. Variable: diabetes R-squared: 0.619
Model: OLS Adj. R-squared: 0.618
Method: Least Squares F-statistic: 783.8
Date: Thu, 25 Jul 2024 Prob (F-statistic): 0.00
Time: 12:30:02 Log-Likelihood: 753.62
No. Observations: 4838 AIC: -1485.
Df Residuals: 4827 BIC: -1414.
Df Model: 10
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept -0.1293 0.072 -1.801 0.072 -0.270 0.011
Age 0.0008 0.000 3.934 0.000 0.000 0.001
On_Insulin_or_Diabetes_Meds 0.8307 0.013 63.252 0.000 0.805 0.856
Height_cm -0.0013 0.000 -3.000 0.003 -0.002 -0.000
BMI -0.0018 0.002 -1.014 0.311 -0.005 0.002
Arm_Cir 0.0015 0.002 0.962 0.336 -0.002 0.004
Waist_Cir 0.0014 0.001 2.442 0.015 0.000 0.003
Alibumin 0.0037 0.011 0.348 0.728 -0.017 0.024
Blood_Urea -0.0005 0.001 -0.683 0.495 -0.002 0.001
Creatinine 0.0206 0.009 2.410 0.016 0.004 0.037
gh 0.0361 0.004 9.404 0.000 0.029 0.044
==============================================================================
Omnibus: 3716.076 Durbin-Watson: 1.934
Prob(Omnibus): 0.000 Jarque-Bera (JB): 55056.313
Skew: 3.730 Prob(JB): 0.00
Kurtosis: 17.746 Cond. No. 4.93e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.93e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
MSE = 0.04297481930696086
# Perform Backward Selection based on p-Values. Removed predictor with p-Values higher than 0.05
mlr_full_3 = ols( "diabetes ~ Age + On_Insulin_or_Diabetes_Meds + BMI + Creatinine + gh", dftrain2).fit()
print(mlr_full_3.summary())
print('\nMSE =', mlr_full_3.mse_resid)
OLS Regression Results
==============================================================================
Dep. Variable: diabetes R-squared: 0.620
Model: OLS Adj. R-squared: 0.620
Method: Least Squares F-statistic: 1640.
Date: Thu, 25 Jul 2024 Prob (F-statistic): 0.00
Time: 12:37:48 Log-Likelihood: 736.42
No. Observations: 5031 AIC: -1461.
Df Residuals: 5025 BIC: -1422.
Df Model: 5
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept -0.2748 0.023 -12.080 0.000 -0.319 -0.230
Age 0.0010 0.000 6.194 0.000 0.001 0.001
On_Insulin_or_Diabetes_Meds 0.8252 0.013 64.801 0.000 0.800 0.850
BMI 0.0019 0.000 4.312 0.000 0.001 0.003
Creatinine 0.0182 0.006 2.871 0.004 0.006 0.031
gh 0.0391 0.004 10.417 0.000 0.032 0.046
==============================================================================
Omnibus: 3810.676 Durbin-Watson: 1.931
Prob(Omnibus): 0.000 Jarque-Bera (JB): 54847.512
Skew: 3.668 Prob(JB): 0.00
Kurtosis: 17.416 Cond. No. 457.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE = 0.043742369327149255
res3 = mlr_full_3.resid
len(res3)
4838
sns.residplot(x=mlr_full_3.fittedvalues, y=res3)
plt.title('Residual Plot')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
#non linear pattern
### The residual plot is about the same as above.
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
for i in range(6):
predictor = mlr_full_3.model.exog_names[i]
Vif3 = vif(mlr_full_3.model.exog,i)
print(predictor,Vif3)
Intercept 580.8932588211743 Age 1.8482677447172386 On_Insulin_or_Diabetes_Meds 1.4994464890949895 Height_cm 2.047171850276534 BMI 16.531247060674165 Arm_Cir 7.4651800718503525
# Create interation of variables
#"diabetes ~ Age + On_Insulin_or_Diabetes_Meds + BMI + Height_cm + Creatinine + gh"
#"diabetes ~ Age + On_Insulin_or_Diabetes_Meds + BMI + Creatinine + gh"
mlr_full_4 = ols("diabetes ~ Age + On_Insulin_or_Diabetes_Meds + BMI + Creatinine + gh + BMI*Creatinine + On_Insulin_or_Diabetes_Meds*gh", dftrain2).fit()
print(mlr_full_4.summary())
print('\nMSE =', mlr_full_4.mse_resid)
OLS Regression Results
==============================================================================
Dep. Variable: diabetes R-squared: 0.624
Model: OLS Adj. R-squared: 0.624
Method: Least Squares F-statistic: 1191.
Date: Thu, 25 Jul 2024 Prob (F-statistic): 0.00
Time: 12:33:44 Log-Likelihood: 762.66
No. Observations: 5031 AIC: -1509.
Df Residuals: 5023 BIC: -1457.
Df Model: 7
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept -0.3860 0.041 -9.525 0.000 -0.465 -0.307
Age 0.0007 0.000 4.584 0.000 0.000 0.001
On_Insulin_or_Diabetes_Meds 1.2005 0.053 22.525 0.000 1.096 1.305
BMI 0.0016 0.001 1.295 0.195 -0.001 0.004
Creatinine 0.0215 0.036 0.590 0.555 -0.050 0.093
gh 0.0627 0.005 12.658 0.000 0.053 0.072
BMI:Creatinine -0.0001 0.001 -0.077 0.938 -0.003 0.003
On_Insulin_or_Diabetes_Meds:gh -0.0556 0.008 -7.257 0.000 -0.071 -0.041
==============================================================================
Omnibus: 3778.188 Durbin-Watson: 1.931
Prob(Omnibus): 0.000 Jarque-Bera (JB): 55347.109
Skew: 3.612 Prob(JB): 0.00
Kurtosis: 17.555 Cond. No. 1.23e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.23e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
MSE = 0.04330566828408332
mlr_full_5 = ols("diabetes ~ Age + I(On_Insulin_or_Diabetes_Meds**2) + BMI + Creatinine + gh + On_Insulin_or_Diabetes_Meds*gh", dftrain2).fit()
print(mlr_full_5.summary())
print('\nMSE =', mlr_full_5.mse_resid)
OLS Regression Results
==============================================================================
Dep. Variable: diabetes R-squared: 0.624
Model: OLS Adj. R-squared: 0.624
Method: Least Squares F-statistic: 1390.
Date: Thu, 25 Jul 2024 Prob (F-statistic): 0.00
Time: 12:35:42 Log-Likelihood: 762.66
No. Observations: 5031 AIC: -1511.
Df Residuals: 5024 BIC: -1466.
Df Model: 6
Covariance Type: nonrobust
=======================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------
Intercept -0.3837 0.027 -14.130 0.000 -0.437 -0.330
Age 0.0007 0.000 4.585 0.000 0.000 0.001
I(On_Insulin_or_Diabetes_Meds ** 2) 0.6002 0.027 22.555 0.000 0.548 0.652
BMI 0.0015 0.000 3.363 0.001 0.001 0.002
Creatinine 0.0187 0.006 2.965 0.003 0.006 0.031
gh 0.0627 0.005 12.662 0.000 0.053 0.072
On_Insulin_or_Diabetes_Meds 0.6002 0.027 22.555 0.000 0.548 0.652
On_Insulin_or_Diabetes_Meds:gh -0.0556 0.008 -7.258 0.000 -0.071 -0.041
==============================================================================
Omnibus: 3778.285 Durbin-Watson: 1.931
Prob(Omnibus): 0.000 Jarque-Bera (JB): 55350.052
Skew: 3.612 Prob(JB): 0.00
Kurtosis: 17.555 Cond. No. 1.91e+17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.33e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
MSE = 0.04329709988103107
res5 = mlr_full_5.resid
len(res5)
5031
sns.residplot(x=mlr_full_5.fittedvalues, y=res5)
plt.title('Residual Plot')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
#linear pattern
# To inspect skewness, produce histograms
fig = plt.figure(figsize=(12,4))
ax = fig.add_subplot(121)
plt.hist(dftrain2.diabetes) # the response Y
plt.title('histogram of response');
ax = fig.add_subplot(122)
plt.hist(res5) # the residuals
plt.title('histogram of residual');
## Both plot are right skewiness
Using multilinear regression for a binary response variable is not appropriate because multilinear regression assumes a continuous response variable. For binary response data, logistic regression is typically used instead. Here’s why:
Multilinear Regression
When the response variable is binary (e.g., 0 or 1), these assumptions are violated. Specifically:
- The response variable is not continuous.
- The error terms do not follow a normal distribution.
- The relationship between predictors and the binary response is not necessarily linear.
Logistic Regression
dfx = pd.read_excel('Glycohemoglobin_t4_ShortName.xlsx')
# Split data set into 75:25.
from sklearn.model_selection import train_test_split
dfxtrain, dfxtest = train_test_split(dfx, test_size=0.25, random_state=1938)
# If you want to confirm ratio splitted:
print( len(dfxtrain)/len(dfx) )
print( len(dfxtest)/len(dfx) )
0.749963208241354 0.25003679175864607
formula = (
"gh ~ Sex + Age + Race_Or_Ethnicity + Income_Min + Income_Max + On_Insulin_or_Diabetes_Meds + Weight_Kg + Height_cm + BMI + Up_Leg_Len + Up_Arm_Len + Arm_Cir + Waist_Cir + Triceps_Sk + Subscapular_Sk + Alibumin + Blood_Urea + Creatinine + diabetes"
)
mlr_full_x = ols(formula, dfxtrain).fit()
print(mlr_full_x.summary())
print('\nMSE =', mlr_full_x.mse_resid)
OLS Regression Results
==============================================================================
Dep. Variable: gh R-squared: 0.395
Model: OLS Adj. R-squared: 0.391
Method: Least Squares F-statistic: 110.4
Date: Thu, 25 Jul 2024 Prob (F-statistic): 0.00
Time: 13:14:58 Log-Likelihood: -3442.7
No. Observations: 3238 AIC: 6925.
Df Residuals: 3218 BIC: 7047.
Df Model: 19
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 8.5258 0.976 8.737 0.000 6.612 10.439
Sex 0.1347 0.042 3.219 0.001 0.053 0.217
Age 0.0057 0.001 6.676 0.000 0.004 0.007
Race_Or_Ethnicity -0.0335 0.012 -2.907 0.004 -0.056 -0.011
Income_Min 1.151e-06 4.89e-06 0.236 0.814 -8.43e-06 1.07e-05
Income_Max -1.872e-06 3.98e-06 -0.470 0.638 -9.67e-06 5.93e-06
On_Insulin_or_Diabetes_Meds 1.0742 0.073 14.648 0.000 0.930 1.218
Weight_Kg 0.0101 0.006 1.590 0.112 -0.002 0.023
Height_cm -0.0122 0.006 -2.000 0.046 -0.024 -0.000
BMI -0.0523 0.018 -2.883 0.004 -0.088 -0.017
Up_Leg_Len -0.0095 0.006 -1.652 0.099 -0.021 0.002
Up_Arm_Len 0.0066 0.009 0.709 0.478 -0.012 0.025
Arm_Cir 0.0139 0.007 1.888 0.059 -0.001 0.028
Waist_Cir 0.0050 0.003 1.888 0.059 -0.000 0.010
Triceps_Sk -0.0076 0.003 -2.562 0.010 -0.013 -0.002
Subscapular_Sk 0.0127 0.003 5.008 0.000 0.008 0.018
Alibumin -0.3053 0.044 -6.909 0.000 -0.392 -0.219
Blood_Urea 9.234e-05 0.003 0.033 0.974 -0.005 0.006
Creatinine -0.1726 0.041 -4.245 0.000 -0.252 -0.093
diabetes 0.6090 0.061 10.041 0.000 0.490 0.728
==============================================================================
Omnibus: 2903.493 Durbin-Watson: 1.890
Prob(Omnibus): 0.000 Jarque-Bera (JB): 159056.156
Skew: 4.066 Prob(JB): 0.00
Kurtosis: 36.359 Cond. No. 5.01e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.01e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
MSE = 0.49397423456823825